[python] Fix read_paimon ArrowInvalid on PK tables with single-snapshot data by TheR1sing3un · Pull Request #7820 · apache/paimon

TheR1sing3un · 2026-05-12T03:29:13Z

Purpose

read_paimon() crashes with pyarrow.lib.ArrowInvalid when reading
a primary-key table whose data consists of a single snapshot (all
splits are raw-convertible). The issue is in
RayDatasource._get_read_task:

yield pyarrow.Table.from_batches([batch], schema=schema)

schema comes from PyarrowFieldParser.from_paimon_schema and marks
PK columns as NOT NULL. The batch from the Parquet reader may have
those columns as nullable — from_batches does a strict schema equality
check (including the nullable bit) and rejects the mismatch.

This is a pre-existing issue on master. It was never triggered by
existing tests because they all write multiple snapshots (creating
non-raw-convertible splits that go through the merge-read path, which
preserves nullability).

Linked Issue

Discovered while testing PR #7813 on CI (Python 3.10 / pyarrow in the
CI container triggers the strict check; newer pyarrow on local dev
machines is more lenient).

Fix

Replace the strict from_batches([batch], schema=schema) with:

table = pyarrow.Table.from_batches([batch])
if table.schema != schema:
    table = table.cast(schema)
yield table

Table.cast(target_schema) is a zero-copy metadata-only operation for
nullable→not-null diffs. It also handles other type promotions (e.g.
large_string → string) that may occur on some Ray versions.

When schemas already match, the if branch is skipped — zero overhead.

Tests

Added test_read_paimon_pk_single_snapshot: PK table + single write +
read_paimon() — verifies no ArrowInvalid on raw-convertible splits.

All existing ray_integration_test.py tests remain green.

API & Format Impact

None. Pure internal fix in the Ray read task function.

Documentation Impact

None.

Generative AI Disclosure

Drafted with Claude Code assistance, reviewed and tested by the author.

…ible splits Table.from_batches rejects batches whose schema differs from the declared schema in the nullable bit — PK columns are marked NOT NULL in the Paimon schema but the Parquet reader may produce nullable fields on certain pyarrow versions. Use Table.cast to align the batch schema before yielding to Ray, which is a zero-copy metadata-only operation for nullable diffs. This fixes read_paimon crashing with ArrowInvalid on PK tables where all splits are raw-convertible (e.g. single-snapshot data with no overlapping keys).

JingsongLi

+1

JingsongLi approved these changes May 12, 2026

View reviewed changes

JingsongLi merged commit 86e6eed into apache:master May 12, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Fix read_paimon ArrowInvalid on PK tables with single-snapshot data#7820

[python] Fix read_paimon ArrowInvalid on PK tables with single-snapshot data#7820
JingsongLi merged 1 commit into
apache:masterfrom
TheR1sing3un:py-fix-ray-read-nullable-schema

TheR1sing3un commented May 12, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheR1sing3un commented May 12, 2026

Purpose

Linked Issue

Fix

Tests

API & Format Impact

Documentation Impact

Generative AI Disclosure

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants